NYC is a hub of diversity, yet there is one similarity that 8.5 million people shares – everyone eats. Over 26,000 eating establishments in the city (according to the NYC Department of Health and Mental Hygiene or DOHMH) thrives here because New Yorkers dine out over 58% of their lunches or dinners and spend $46.14 on average for each meal per person (Zagat Dining Trends Survey 2018). For these customers, ourselves included, the food that we purchase so frequently and at such high costs is a major factor influencing our health. To help people stay informed about food safety and the potential health risks that eating out poses, we conducted our research about restaurants in NYC based on DOHMH’s source data hosted on Open Data.
For better interactivity, we used plotly for most of our graphs and include a d3 visualization for missing data.
\(Team\) \(Members:\) Cindy Wu, Julie Yu, Lisa Kim, Selina Tang
Cindy- data description, data quality, time series overview
Julie- grade distributions and analysis
Lisa- violation distributions and analysis
Selina- restaurants case by case
Our dataset is collected from NYC Open Data along with a data description provided by the website
375,989 rows of data entries of NYC restaurant inspection information between June 2012 - April 2018
Each row consists of 18 variables including (and not limited to) restaurant information (name, address, zipcode, borough, cuisine type), inspection results (inspection type, inspection date), violation details (violation code with corresponding violation description), and scorings (score, letter grade). For the sole purpose of efficiently visualizing inspection results, we only selected the most relevant and most representative 10 variables in the dataset to conduct further exploration. The ten vairbales this report focuses on are - DBA (restaurant name), borough, zipcode, cuisine description,inspection date,violation code, violation description, score, grade, and inspection type
Since restaurants go in and out of business, the dataset only records restaurants that are still active in March 2018 (last inspection included in the data)
click here to see interactive visna plots made with D3
We first explore the distribution of missing data in the original dataset. As shown in the chart and plot, the most common missing pattern in the original data is Grade on its own with over 50% missing data, followed by only around 5% in scores and minor missing percentages in violation details.
## # A tibble: 10 x 4
## variable type value percent_missing
## <chr> <chr> <dbl> <dbl>
## 1 GRADE character 188585 50.2
## 2 SCORE integer 20466 5.45
## 3 VIOLATION.DESCRIPTION character 7106 1.89
## 4 VIOLATION.CODE character 6010 1.60
## 5 INSPECTION.TYPE character 1108 0.295
## 6 DBA character 419 0.112
## 7 BORO character 0 0
## 8 ZIPCODE character 0 0
## 9 CUISINE.DESCRIPTION character 0 0
## 10 INSPECTION.DATE character 0 0
From the official grading system in data description, we learned that the grades are converted from scores accroding to a specific scheme for every ‘gradable’ inspection, and only inspections of certain types are eligble to receive a grade. Due to the large amount of missing values in Grade, we pay more attention to scores and violation details since they are good and direct reflections of inspection results and contain much fewer missing data.
In addition to grades, the data description also states that some scores are missing because there are new restarants that have yet to be inspected, and they are marked by “inspection.date = 01/01/1900”. Therefore, we removed rows with new restaurants’ and looked at missing values again.
The second chart shows the missing value percentages for each variable after removing new restaurants. We see a decrease in missing scores and for our exploratory analysis, we removed all missing values in scores.
## [1] "Number of new restaurants: 1108"
## # A tibble: 10 x 4
## variable type value percent_missing
## <chr> <chr> <dbl> <dbl>
## 1 GRADE character 187478 50.1
## 2 SCORE integer 19358 5.17
## 3 VIOLATION.DESCRIPTION character 5998 1.60
## 4 VIOLATION.CODE character 4902 1.31
## 5 DBA character 0 0
## 6 BORO character 0 0
## 7 ZIPCODE character 0 0
## 8 CUISINE.DESCRIPTION character 0 0
## 9 INSPECTION.DATE character 0 0
## 10 INSPECTION.TYPE character 0 0
The third chart summarizes the final data we used for the report. Although there are still many missing values in grades, we do not consider it bad data quality since they are intentionally left blank.The overall quality of the dataset seems quite promising at this point.
## # A tibble: 10 x 4
## variable type value percent_missing
## <chr> <chr> <dbl> <dbl>
## 1 GRADE character 168130 47.4
## 2 VIOLATION.DESCRIPTION character 1934 0.545
## 3 VIOLATION.CODE character 1492 0.420
## 4 DBA character 0 0
## 5 BORO character 0 0
## 6 ZIPCODE character 0 0
## 7 CUISINE.DESCRIPTION character 0 0
## 8 INSPECTION.DATE character 0 0
## 9 SCORE integer 0 0
## 10 INSPECTION.TYPE character 0 0
To ensure data quality and validify our further analysis, we inspect the data more closely based on two variables we are particularly interested in- boroughs and inspection types.
# By borough
percent_missing1 <- df_1 %>% group_by(BORO) %>%
summarize(num_Restaurants = n(), num_Missing_Score = sum(is.na(`SCORE`))) %>%
mutate(percent_Missing_Score = round(num_Missing_Score/num_Restaurants, 2)) %>%
arrange(-percent_Missing_Score)
p1<- ggplot(percent_missing1)+
geom_line(aes(x= BORO, y =percent_Missing_Score, group=1))+
geom_point(aes(x= BORO, y =percent_Missing_Score),color = "hotpink")+
ylim(0,0.15)+
ggtitle("Missing data percentage by borough")
p1 <- ggplotly(p1)
p1
# By inspection type
percent_missing <- df_1 %>% group_by(INSPECTION.TYPE) %>%
summarize(num_Restaurants = n(), num_Missing_Score = sum(is.na(`SCORE`))) %>%
mutate(percent_Missing_Score = round(num_Missing_Score/num_Restaurants, 2)) %>%
arrange(-percent_Missing_Score)
p2 <- ggplot(percent_missing)+
geom_line(aes(x= INSPECTION.TYPE, y =percent_Missing_Score, group=1))+
geom_point(aes(x= INSPECTION.TYPE, y =percent_Missing_Score),color = "blue")+
ggtitle("Missing data percentage by Inspection Type")
p2 <- ggplotly(p2)
p2
Each borough has very similar percentage in missing values so it is fair to conclude that the data qulity of our dataset is identical across the city.
The following time series plot gives a general overview of NYC restaurant cleanliness over the past five year.
df <- df[!duplicated(df[c('DBA', 'INSPECTION.DATE', 'GRADE')]),]
df$INSPECTION.DATE <- as.Date(as.character(df$INSPECTION.DATE),format="%m/%d/%Y")
ts <- xts(df$SCORE,df$INSPECTION.DATE)
quarterly <- apply.quarterly(ts, FUN =mean)
ts_df<- data.frame(date=index(ts),score = ts)
quarterly_df <- data.frame(date = index(quarterly), quarter_avg = quarterly)
p<-plot_ly(quarterly_df[quarterly_df$date> "2013-05-16" & quarterly_df$date < "2018-03-31" ,])%>%
add_lines( x=~date, y=~quarter_avg,
type = 'scatter',
mode = 'lines+markers',
hoverinfo = 'text',
text = ~paste("Quarter Date:",date,"<br> Average Score: ",round(quarter_avg)))%>%
add_trace(x = ~date,y = ~quarter_avg, mode = 'markers',color=I("orange"),marker = list(size = 8))%>%
layout(title = "Quarter Average Score",showlegend = FALSE)
p
There is a clear pattern that the third quarter (Jul - Sept) has the highest average score each year. Thinking in terms of season and weather, since July, August and September are typically the hottest months in NYC, we can infer that there are more violations during the summer. In the next sections, we go into more details about inspection results and look for more interesting patterns from two main perspectives- grade and violations.
In this section, we will be focusing on the grades the restaurants in NYC received. Like before, we exclued those with dates = 1/1/1900, which indicated that they’re new restaurants yet to be inspected. We also excluded rows without a score since those without a score would also have no grade and that grades are determined based on scores. And we’re keeping missing grades as of now for further investigation later on.
According to the manual of this data set, only inspections of the type in the following four categories are gradable: - Cycle Inspection/Initial Inspection - Cycle Inspection/Re-Inspection - Pre-Permit (Operational)/Initial Inspection - Pre-Permit (Operational)/Re-Inspection) Therefore, in this section, we filtered out restaurants that have inspection types not from the above four categories. Also, for the following studies, we will exclude grades that are “Not Yet Graded”, “P”, or “Z” which indicate pending grades under different circumstances.
In addition, a restaurant can receive multiple violtaions within the same day of an inspection, resulting in multiple rows in the data set. Therefore, the duplicated grades given to a particular restaurant on the same day should be only counted once.
We first looked at the distribution of grades for A, B, and C and excluding missing values.
ggplot(filter(df,grade=='A'|grade=='B'|grade=='C'), aes(x=grade)) + geom_bar() + labs(title="Restaurant Grades")
We then looked at how many missing values there are in grades after eliminating all the duplicated entries.
skimr::skim(df$grade)
## Skim summary statistics
##
## Variable type: character
## variable missing complete n min max empty n_unique
## df$grade 0 118571 118571 0 14 41720 6
According to the summary statistics above, we see that there are many inspections that have no grades (ie empty values). We then checked the manual to try to figure out why there are missing values and ways to deal with them. According to the manual, for restaurants that didn’t receive a score less than 14 (which corresponded to an A) during their initial inspection, they would be given another chance for re-inspection. And a grade would not be given until re-inspection. In order to capture this information and conduct some futher studies, we inputed the missing grades by converting scores to grades according to NYC Department of Health and Mental Hygiene’s (DOHMH) grading rules.
convert_to_grade <- function(x){
if (x < 14){
return("A")
}
else if(x > 28){
return("C")
}
else {
return("B")
}
}
df$imp_grade <- sapply(df$score, convert_to_grade)
We’re now interested in whether there is a difference in the grade distribution between simply looking at grades given by NYC DOHMH as they are and grades that include information of initial inspections.
p1 <- ggplot(filter(df,grade=='A'|grade=='B'|grade=='C'), aes(x=grade)) + geom_bar() + labs(title="Restaurant Grades")
p2 <- ggplot(filter(df,(imp_grade=='A'|imp_grade=='B'|imp_grade=='C')&(inspection_type=="Cycle Inspection / Initial Inspection" | inspection_type== "Pre-permit (Operational) / Initial Inspection")), aes(x=imp_grade)) + geom_bar() + labs(title="Restaurant Grades at Initial Inspections",x="grade")
grid.arrange(p1, p2, nrow = 1)
From the above graphs, we can see that the distributions are quite different. If we had only simply looked at the grades given as they were (graph on the left), the percentage of A restaurants would be pretty high. However, looking at the graph on the right tells us that actually many restaurants did not receive an A at their initial inspection.
Since the above graph on the left consists of both initial inspection and re-inspection grades without missing grades while the graph on the right consists of only initial inspection with imputed grades, it could be hard to make a direct comparison. Therefore, we introduced two variables explained below and made a mosaic plot for smoother transitions from one to another for easier comparison, which also incoporates information about the different sample sizes.
Note that, non-imputed data are just a subset of the imputed data and also initial inspection and re-inspection combined would belong to the “both” category for inspection. We did this because looking at the “both” category is the natural way we’d first look at the data and it’d be hard to imagine this category and doing the math in our head ourselves if we’re to only give the “initial” and “re” categories.
df1 <- filter(df, grade=='A'|grade=='B'|grade=='C') %>% select(c('inspection_type','grade')) %>% group_by(grade) %>% summarise(Freq=n())
df1$imputation <- c("no")
df1$inspection <- c("both")
colnames(df1) <- c("grade","Freq","imputation","inspection")
#summarise(all_insp_df,sum(Freq))
df2 <- filter(df, grade=='A'|grade=='B'|grade=='C') %>% select(c('inspection_type','grade')) %>% filter(inspection_type=="Cycle Inspection / Initial Inspection" | inspection_type== "Pre-permit (Operational) / Initial Inspection") %>% group_by(grade) %>% summarise(Freq=n())
df2$imputation <- c("no")
df2$inspection <- c("initial")
colnames(df2) <- c("grade","Freq","imputation","inspection")
#summarise(all_insp_df,sum(Freq))
df3 <- filter(df, grade=='A'|grade=='B'|grade=='C') %>% select(c('inspection_type','grade')) %>% filter(inspection_type=="Cycle Inspection / Re-inspection" | inspection_type== "Pre-permit (Operational) / Re-inspection") %>% group_by(grade) %>% summarise(Freq=n())
df3$imputation <- c("no")
df3$inspection <- c("re")
colnames(df3) <- c("grade","Freq","imputation","inspection")
#summarise(all_insp_df,sum(Freq))
df4 <- df %>% select(c('inspection_type','imp_grade')) %>% group_by(imp_grade) %>% summarise(Freq=n())
df4$imputation <- c("yes")
df4$inspection <- c("both")
colnames(df4) <- c("grade","Freq","imputation","inspection")
#summarise(imp_insp_df,sum(Freq))
df5 <- df %>% select(c('inspection_type','imp_grade')) %>% filter(inspection_type=="Cycle Inspection / Initial Inspection" | inspection_type== "Pre-permit (Operational) / Initial Inspection") %>% group_by(imp_grade) %>% summarise(Freq=n())
df5$imputation <- c("yes")
df5$inspection <- c("initial")
colnames(df5) <- c("grade","Freq","imputation","inspection")
#summarise(init_insp_df,sum(Freq))
df6 <- df %>% select(c('inspection_type','imp_grade')) %>% filter(inspection_type=="Cycle Inspection / Re-inspection" | inspection_type== "Pre-permit (Operational) / Re-inspection") %>% group_by(imp_grade) %>% summarise(Freq=n())
df6$imputation <- c("yes")
df6$inspection <- c("re")
colnames(df6) <- c("grade","Freq","imputation","inspection")
#summarise(re_insp_df,sum(Freq))
df7 <- rbind(df1,df2,df3,df4,df5,df6)
df7 <- droplevels(df7)
df7$imputation <- as.factor(df7$imputation)
df7$inspection <- as.factor(df7$inspection)
vcd::mosaic(grade ~inspection+imputation,df7,main="Grade Distribution vs Inspection Type vs Imputation")
Note that when fixing inspection=re, there is slight difference before and after imputation when there shouldn’t be because, according to the manual, it’s possible that a record of the grade is simply missing in the data even if the grade was issued. And as we’d expect, without imputation (filling in missing grades), the initial inspections only have A grades. Fixing inspection=initial, we see a big difference when we impute the data – there are about 40% of restaurants that actually did not receive an A during their first inspection. As the way the data was given to us (“both”+“none” categories), if we simply ignore the missing values, we’d assume that a high percentage of restaurants in NYC have an A. However, as shown in the graph, this is not true if we take into account the initial inspection grades (ie “yes” category). So this tells us that we shouldn’t just trust that almost all restaurants in NYC have been always clean and healthy. But also, we can see that the enforcement of inspections can be effective in encouraging the restaurants to improve their cleanliness and health levels.
We now shift our focus to examining if certain types of restaurants are cleaner and healthier than ther others. Since New York City requires restaurant to post their grades and the customers are often more familiar with this grading system, we decided to look at grades instead of scores. Below are tables showing top 20 restaurant types with the highest percentage of receiving an A (count of A’s/total counts of A+B+C within its own category) and a C. And for this study, we decided to use grades from both initial insepctions and re-inspections excluding missing grades since these are the actual grades the restaurants will post.
df_cuisine <- df[,c('name','cuisine','grade')]
df_cuisine <- df_cuisine %>% filter(grade=='A'|grade=='B'|grade=='C')
df_cuisine <- df_cuisine %>% group_by(cuisine) %>% summarise(count_A=sum(grade == "A", na.rm = T),count_B=sum(grade == "B", na.rm = T),count_C=sum(grade == "C", na.rm = T))
levels(df_cuisine$cuisine)[levels(df_cuisine$cuisine)=="Café/Coffee/Tea"] <- "Cafe/Coffee/Tea" #for better display
df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(total_count=count_A+count_B+count_C)
df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(percentage_A=count_A/total_count)
df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(percentage_B=count_B/total_count)
df_cuisine <- df_cuisine %>% group_by(cuisine) %>% mutate(percentage_C=count_C/total_count)
kable(arrange(df_cuisine, desc(`percentage_A`))[1:20,],caption = "Top 20 A Percentages")
| cuisine | count_A | count_B | count_C | total_count | percentage_A | percentage_B | percentage_C |
|---|---|---|---|---|---|---|---|
| Basque | 1 | 0 | 0 | 1 | 1.0000000 | 0.0000000 | 0.0000000 |
| Cajun | 30 | 0 | 0 | 30 | 1.0000000 | 0.0000000 | 0.0000000 |
| Czech | 10 | 0 | 0 | 10 | 1.0000000 | 0.0000000 | 0.0000000 |
| Fruits/Vegetables | 16 | 0 | 0 | 16 | 1.0000000 | 0.0000000 | 0.0000000 |
| Not Listed/Not Applicable | 22 | 0 | 0 | 22 | 1.0000000 | 0.0000000 | 0.0000000 |
| Nuts/Confectionary | 13 | 0 | 0 | 13 | 1.0000000 | 0.0000000 | 0.0000000 |
| Scandinavian | 14 | 0 | 0 | 14 | 1.0000000 | 0.0000000 | 0.0000000 |
| Soups | 12 | 0 | 0 | 12 | 1.0000000 | 0.0000000 | 0.0000000 |
| Soups & Sandwiches | 143 | 3 | 0 | 146 | 0.9794521 | 0.0205479 | 0.0000000 |
| Other | 359 | 9 | 3 | 371 | 0.9676550 | 0.0242588 | 0.0080863 |
| Portuguese | 29 | 1 | 0 | 30 | 0.9666667 | 0.0333333 | 0.0000000 |
| Pancakes/Waffles | 56 | 1 | 1 | 58 | 0.9655172 | 0.0172414 | 0.0172414 |
| Salads | 183 | 7 | 0 | 190 | 0.9631579 | 0.0368421 | 0.0000000 |
| Hotdogs | 76 | 3 | 0 | 79 | 0.9620253 | 0.0379747 | 0.0000000 |
| Ice Cream, Gelato, Yogurt, Ices | 787 | 27 | 6 | 820 | 0.9597561 | 0.0329268 | 0.0073171 |
| Café/Coffee/Tea | 3944 | 150 | 31 | 4125 | 0.9561212 | 0.0363636 | 0.0075152 |
| Steak | 271 | 9 | 4 | 284 | 0.9542254 | 0.0316901 | 0.0140845 |
| Sandwiches/Salads/Mixed Buffet | 598 | 28 | 4 | 630 | 0.9492063 | 0.0444444 | 0.0063492 |
| Hamburgers | 1055 | 48 | 10 | 1113 | 0.9478886 | 0.0431267 | 0.0089847 |
| Donuts | 1114 | 55 | 8 | 1177 | 0.9464741 | 0.0467290 | 0.0067969 |
kable(arrange(df_cuisine, desc(`percentage_C`))[1:20,],caption = "Top 20 C Percentages")
| cuisine | count_A | count_B | count_C | total_count | percentage_A | percentage_B | percentage_C |
|---|---|---|---|---|---|---|---|
| Polynesian | 2 | 1 | 1 | 4 | 0.5000000 | 0.2500000 | 0.2500000 |
| Iranian | 8 | 0 | 1 | 9 | 0.8888889 | 0.0000000 | 0.1111111 |
| Bangladeshi | 107 | 21 | 8 | 136 | 0.7867647 | 0.1544118 | 0.0588235 |
| Australian | 51 | 2 | 2 | 55 | 0.9272727 | 0.0363636 | 0.0363636 |
| Creole | 63 | 18 | 3 | 84 | 0.7500000 | 0.2142857 | 0.0357143 |
| Peruvian | 198 | 36 | 8 | 242 | 0.8181818 | 0.1487603 | 0.0330579 |
| African | 174 | 52 | 7 | 233 | 0.7467811 | 0.2231760 | 0.0300429 |
| Asian | 876 | 113 | 27 | 1016 | 0.8622047 | 0.1112205 | 0.0265748 |
| Korean | 672 | 110 | 21 | 803 | 0.8368618 | 0.1369863 | 0.0261519 |
| Soul Food | 131 | 20 | 4 | 155 | 0.8451613 | 0.1290323 | 0.0258065 |
| Latin (Cuban, Dominican, Puerto Rican, South & Central American) | 2460 | 448 | 77 | 2985 | 0.8241206 | 0.1500838 | 0.0257956 |
| Moroccan | 33 | 5 | 1 | 39 | 0.8461538 | 0.1282051 | 0.0256410 |
| Vegetarian | 284 | 25 | 8 | 317 | 0.8958991 | 0.0788644 | 0.0252366 |
| Indian | 838 | 113 | 24 | 975 | 0.8594872 | 0.1158974 | 0.0246154 |
| Spanish | 1579 | 264 | 45 | 1888 | 0.8363347 | 0.1398305 | 0.0238347 |
| Caribbean | 1761 | 289 | 50 | 2100 | 0.8385714 | 0.1376190 | 0.0238095 |
| Tapas | 83 | 6 | 2 | 91 | 0.9120879 | 0.0659341 | 0.0219780 |
| Mexican | 2374 | 282 | 57 | 2713 | 0.8750461 | 0.1039440 | 0.0210100 |
| Middle Eastern | 439 | 38 | 10 | 487 | 0.9014374 | 0.0780287 | 0.0205339 |
| Pakistani | 82 | 14 | 2 | 98 | 0.8367347 | 0.1428571 | 0.0204082 |
tidyv_cuisine <- df_cuisine[,c('cuisine','percentage_A','percentage_B','percentage_C','total_count')]
tidyv_cuisine <- tidyv_cuisine %>% rename(A = percentage_A,B = percentage_B,C = percentage_C)
tidyv_cuisine_top_counts <- arrange(tidyv_cuisine, desc(`total_count`))[1:20,]
tidyv_cuisine_top_counts$cuisine <- factor(tidyv_cuisine_top_counts$cuisine,
levels=tidyv_cuisine_top_counts$cuisine[order(tidyv_cuisine_top_counts$A)])
levels(tidyv_cuisine_top_counts$cuisine)[levels(tidyv_cuisine_top_counts$cuisine)=="Latin (Cuban, Dominican, Puerto Rican, South & Central American)"] <- "Latin" # for better display
tidyv_cuisine_top_counts <- tidyv_cuisine_top_counts %>% gather(key=GradeType,value=Percentage,-cuisine,-total_count)
tidyv_cuisine_top_A <- arrange(tidyv_cuisine, desc(`A`))[1:20,]
tidyv_cuisine_top_A$cuisine <- factor(tidyv_cuisine_top_A$cuisine,
levels=tidyv_cuisine_top_A$cuisine[order(tidyv_cuisine_top_A$A)])
tidyv_cuisine_top_A <- tidyv_cuisine_top_A %>% gather(key=GradeType,value=Percentage,-cuisine,-total_count)
tidyv_cuisine_top_C <- arrange(tidyv_cuisine, desc(`C`))[1:20,]
tidyv_cuisine_top_C$cuisine <- factor(tidyv_cuisine_top_C$cuisine,
levels=tidyv_cuisine_top_C$cuisine[order(tidyv_cuisine_top_C$C)])
levels(tidyv_cuisine_top_C$cuisine)[levels(tidyv_cuisine_top_C$cuisine)=="Latin (Cuban, Dominican, Puerto Rican, South & Central American)"] <- "Latin"
tidyv_cuisine_top_C <- tidyv_cuisine_top_C %>% gather(key=GradeType,value=Percentage,-cuisine,-total_count)
ggplot(tidyv_cuisine_top_counts, aes(x=cuisine,y=Percentage,fill=GradeType)) +
geom_bar(stat="identity") +
coord_flip()+labs(title="Top 20 Common Restaurants' Grade Percentages")
ggplot(tidyv_cuisine_top_A, aes(x=cuisine,y=Percentage,fill=GradeType)) +
geom_bar(stat="identity") +
coord_flip()+labs(title="Top 20 A Percentages")
ggplot(tidyv_cuisine_top_C, aes(x=cuisine,y=Percentage,fill=GradeType,order_by())) +
geom_bar(stat="identity") +
coord_flip()+labs(title="Top 20 C Percentages")
The first plot is the grade percentages for the top 20 types of restaurants with the highest total grade counts. The second plot is the grade percentages for the top 20 restaurant types that received the highest percentages of A. And the third plot is the grade percentages for the top 20 restaurant types that received the highest percentages of C.
We can see that among the restaurant types with the top total counts, Cafe, Hamburgers and Donuts are the top 3 in receiving A’s and all of them also made the top 20 list of receiving A’s among all restaurant types. We can also see that there are several that have 100% A’s, including Cajun, Czech, Fruits/Vegetables, Not Listed/Not Applicable, Nuts/Confectionary, Scandinavian and Soups. However, if we look closer on the tables above, these types of restaurants actually have a very small sample size (<=30). Therefore, it’s possible that them having very few samples made it less likely to have that one bad sample in order to achieve 100% A’s. Among the top 20 C list, we see that Polynesian, Iranian, and Bangladeshi made the top 3. However, if we look at the table again, we see that both Polynesian and Iranian have a sample less than 10. So again, this tells us that they may not be representative. And in general, the top A list consists of European food while the top C list is a mixture of Latin, Asian, African, Caribbean, Australian, and Spanish restaurants.
In the following sections, we will dive deeper into different violations, scores, and some of our most frequented restaurants.
## [1] 354841
Combining the cuisine descriptions
df$cuisine <- factor(df$cuisine)
levels(df$cuisine) <- sub("Pizza/Italian", "Italian", levels(df$cuisine))
levels(df$cuisine) <- sub("Pizza", "Italian", levels(df$cuisine))
levels(df$cuisine) <- sub("CafÃ\u0083©/Coffee/Tea", "Cafe", levels(df$cuisine))
levels(df$cuisine) <- sub("Fruits/Vegetables", "Salads",levels(df$cuisine))
levels(df$cuisine) <- sub("Hotdogs/Pretzels", "Hotdogs", levels(df$cuisine))
levels(df$cuisine) <- sub("Ice Cream, Gelato, Yogurt, Ices", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Juice, Smoothies, Fruit Salads", "Salads", levels(df$cuisine))
levels(df$cuisine) <- sub("Latin (Cuban, Dominican, Puerto Rican, South & Central American)",
"Latin", levels(df$cuisine))
levels(df$cuisine) <- sub("Sandwiches/Salads/Mixed Buffet", "Sandwiches", levels(df$cuisine))
levels(df$cuisine) <- sub("Bottled beverages, including water, sodas, juices, etc.", "Dessert", levels(df$cuisine))
levels(df$cuisine) <- sub("Not Listed/Not Applicable", "Other", levels(df$cuisine))
levels(df$cuisine) <- sub("Latin (Cuban, Dominican, Puerto Rican, South & Central American)", "Latin", levels(df$cuisine))
levels(df$cuisine) <- sub("Soups & Sandwiches", "Sandwiches", levels(df$cuisine))
Because some of the restaurant types are too specific and contain special characters, we decide to combine some of the categories to make it more simple and interpretable. As a result, all the restaurants with restaurant types that include word Pizza are categorized as Italian restaurants; The stores that sell ice cream, drinks are categorized as Cafe; Café/Coffee/Tea is replaced with care. The restaurants with the restaurant type of not listed/not applicable are categorized as others, etc.
df$score <- as.numeric(df$score)
ggplot(df, aes(score)) + geom_histogram(binwidth = 5, boundary = 0, color = "black", fill = "blue") + labs(title = "Distribution of scores")
ggplot(df, aes(score)) + geom_histogram(binwidth = 5, boundary = 0, color = "black", fill = "blue") + labs(title = "Distribution of scores based on borough") +
facet_grid(boro~., scale = "free")
The plot shown above describes the overall pattern of the scores. It is shown that the distribution of scores is skewed to the right. While it sounds more reasonable that the scores should be non-negative, some of the restaurants receive negative score, causing the histogram to start at negative values. The scores that the restaurants mostly received for any type of violations are between 10 to 20. Also, there exist quite a lot of restaurants that receive a score higher than 25 for any type of violations. Because the score above 25 is associated with grade C, which is bad, we could probably consider such restaurants as not cleaned and not well maintained.
Also, we plot the distribution of scores based on boroughs to explore the relationship between borough and scores. Because we are interested in the distribution for each borough, we set the scale to free to make sure the distributions for some boroughs do not shrink. While there is a slight change in the distribution of scores across boroughs, they are all highly skewed to the right, and the shape of the distributions looks almost the same. This shows that there is no or little relationship between the boroughs and the scores.
# grade distribution for each violation code
violation_df <- df %>% select(violation_code, grade) %>% group_by(violation_code, grade) %>% summarize(count = n())
violation_df <- violation_df %>% filter(!is.na(violation_code))
violation_df = df %>% select(violation_code) %>% group_by(violation_code) %>% summarize(count = n()) %>% arrange(-count)
ggplot(violation_df, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of violation code", x = "violation code")
violation_df_top10 <- violation_df[1:10, ]
violation_des <- violation[violation$VIOLATION.CODE %in% violation_df_top10$violation_code, ]
violation_des$VIOLATION.CODE <- factor(violation_des$VIOLATION.CODE, levels = violation_df_top10$violation_code)
violation_des <- violation_des %>% arrange(VIOLATION.CODE)
print(violation_des)
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 04L
## 4 06C
## 5 06D
## 6 02G
## 7 10B
## 8 02B
## 9 04N
## 10 04H
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 5 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 6 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 8 Hot food item not held at or above 140Ã\u0082º F.
## 9 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
ggplot(violation_df_top10, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of top 10 violation code", x = "violation code")
The most commonly violated violation type is 10F, ‘Non-food contact surface improperly constructed’. While this violation seems not too serious, the 08A and 04L are associated with vermin and mice, which sound more disgusting and serious. Moreover, 06C is the violation code that is ‘Food not protected from potential source of contamination during storage, preparation, transportation, display or service.’ Learning that there are restaurants in New York which have issue with vermin and mice while not properly storing and protecting foods from any potential source of contamination may suggest us to be very careful and picky when deciding which restaurant to go.
#getting the number of restaurants for each restaurant type
distinct_res <- df %>% select(cuisine, name) %>% distinct(cuisine, name) %>% mutate(count = 1)
distinct_res <- distinct_res %>% select(cuisine, count) %>% group_by(cuisine) %>% summarise(total_num = sum(count)) %>% arrange(-total_num)
ggplot(distinct_res, aes(reorder(cuisine, total_num), total_num)) + geom_col() + coord_flip() +
labs(title = "Number of restaurants", x = "Cuisine")
top5res <- distinct_res[1:5, ]
ggplot(top5res, aes(reorder(cuisine, total_num), total_num)) + geom_col() + coord_flip() +
labs(title = "Number of restaurants", x = "Cuisine")
The bar chart shows the number of restaurants for each restaurant type. Because there are more than 50 restaurant types in this data, we decide to focus on 5 most common restaurant types. The top 5 restaurant types are following: American, Italian, Chinese, Cafe, and Latin. As shown in the plot above, the most common restaurant type is American (5310), followed by Italian (2224) and Chinese (2088).
# getting the information associated with 5 most common restaurant types
top5 <- df[df$cuisine %in% top5res$cuisine, ]
top5 <- na.omit(top5)
top5$cuisine <- factor(top5$cuisine)
# group by violation code to see the most commonly violated violation codes
top5_violations = top5 %>% select(violation_code) %>% group_by(violation_code) %>% summarize(count = n()) %>% arrange(-count)
ggplot(top5_violations, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of violation code", x = "violation code")
violations_10mostcommon <- top5_violations[1:10, ]
new_violation <- violation[violation$VIOLATION.CODE %in% violations_10mostcommon$violation_code, ]
new_violation$VIOLATION.CODE <- factor(new_violation$VIOLATION.CODE, levels = violations_10mostcommon$violation_code)
new_violation <- new_violation %>% arrange(VIOLATION.CODE)
print(new_violation)
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 06D
## 4 04L
## 5 06C
## 6 02G
## 7 10B
## 8 02B
## 9 04N
## 10 04H
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 4 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 8 Hot food item not held at or above 140Ã\u0082º F.
## 9 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
ggplot(violations_10mostcommon, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = "Distribution of 10 most common violation code for 5 most common restaurant types", x = "violation code")
We only focus on the restaurants from 5 most common restaurant types to explore what type of violations the restaurants made the most. The above plot shows 10 most common violations made by these restaurants. The violation code that most frequently appears is 10F, which is ‘Non-food contact surface improperly constructed. Unacceptable material used’ While this violation does not sound very critical, 04L is associated with mice, which sounds more serious and critical. Out of 93 violations, this one has ranked at top 4 and is violated 13788 times. Because the above plot indicates the overall frequency of violations, we decide to explore the frequency of the violation codes for specific restaurant type. Again, as discussed above, we focus only on the violations made by restaurants from 5 most common restaurant types.
# this funciton takes the string (restaurant type) as an input and returns a bar chart and a dataframe that contains the information about the violation codes and the corresponding violation descriptions
violation_code_with_restype <- function(x){
new_df <- df %>% filter(cuisine == x) %>% select(violation_code) %>% group_by(violation_code) %>%
summarize(count = n()) %>% arrange(-count) %>% top_n(10)
new_violation <- violation[violation$VIOLATION.CODE %in% new_df$violation_code, ]
new_violation$VIOLATION.CODE <- factor(new_violation$VIOLATION.CODE, levels = new_df$violation_code)
new_violation <- new_violation %>% arrange(VIOLATION.CODE)
print(new_violation)
ggplot(new_df, aes(reorder(violation_code, count), count)) + geom_col() + coord_flip() +
labs(title = paste("Distribution of violation code of", x, "restaurants", sep = " "), x = "violation code")
}
violation_code_with_restype("American")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 06D
## 4 06C
## 5 02G
## 6 04L
## 7 10B
## 8 04N
## 9 02B
## 10 04H
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 4 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 5 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 6 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 7 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 8 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 9 Hot food item not held at or above 140Ã\u0082º F.
## 10 Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan.
We first focus on the violation types that American restaurants made. In order to produce a bar chart of 10 most common violation made by these restaurants, the steps we take are following: 1. we first filter out the rows with restaurant type = American 2. Then, we group the dataframe by the violation code and then count the number of each violation has made 3. we arrange the dataframe by the number of the occurrence and then extract top 10 violation codes to produce a bar chart The following steps are used to draw a plot for any restaurant type described below.
While the frequency of the violation code changes (because we are only looking at one specific type of the restaurants), the most commonly violated violation types stay almost the same. The 4 most common violation types are same as the plot shown before this one. Though, these restaurants seem to violate 02G, which associates with the storing cold food item above 41 Farenheit, more often than 10B, which associates with not properly installing or maintaining plumbing. Also, they tend to violate the rule associated with storing and maintaining foods as their 10 most violated codes contain the one associated with ‘Raw, cooked or prepared food is adulterated, contaminated, cross-contaminated, or not discarded in accordance with HACCP plan’
violation_code_with_restype("Italian")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 04L
## 4 06D
## 5 02G
## 6 02B
## 7 06C
## 8 10B
## 9 04N
## 10 04A
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 4 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 5 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 6 Hot food item not held at or above 140Ã\u0082º F.
## 7 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 8 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 10 Food Protection Certificate not held by supervisor of food operations.
Next, we decide to explore the violation types the Italian restaurants have made. Comparing all three plots, it seems that the restaurants of different types tend to make similar violations. The 10 most common violation types for Italian restaurants are exactly same as the American restaurants except that the orders of violation codes (in terms of frequency) changes. The Italian restaurants in New York area seem to violate the rule associated with installing and maintaining plumbing properly more than the American restaurants in New York. Moreover, compared to the overall frequency of violation codes, these two types of restaurants seem to have more issue with mice.
violation_code_with_restype("Chinese")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 06C
## 4 04L
## 5 02B
## 6 02G
## 7 06D
## 8 10B
## 9 04M
## 10 04N
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 4 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5 Hot food item not held at or above 140Ã\u0082º F.
## 6 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 8 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9 Live roaches present in facility's food and/or non-food areas.
## 10 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
The 10 most common violation types for Chinese restaurants seem to be slightly different from those of the previous two restaurant types. The violation of 04L (Evidence of mice or live mice present in facility’s food and/or non-food areas) is more frequent (in terms of relative frequency). Moreover, the 04M code is violated more than 1000 times and is ranked at top 9. This one is associated with live roaches present in facility’s food and/or non-food areas. So far, our results on the different types of restaurants show us that it would probably be a great idea to look at this report first to avoid the restaurants with roaches and mice.
violation_code_with_restype("Cafe")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 06D
## 4 06C
## 5 10B
## 6 02G
## 7 04L
## 8 04N
## 9 04A
## 10 10H
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 4 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 5 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 6 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 7 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 8 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 9 Food Protection Certificate not held by supervisor of food operations.
## 10 Proper sanitization not provided for utensil ware washing operation.
violation_code_with_restype("Latin (Cuban, Dominican, Puerto Rican, South & Central American)")
## Selecting by count
## VIOLATION.CODE
## 1 10F
## 2 08A
## 3 02B
## 4 04L
## 5 06C
## 6 04N
## 7 02G
## 8 10B
## 9 06D
## 10 04M
## VIOLATION.DESCRIPTION
## 1 Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, above and underneath the unit.
## 2 Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.
## 3 Hot food item not held at or above 140Ã\u0082º F.
## 4 Evidence of mice or live mice present in facility's food and/or non-food areas.
## 5 Food not protected from potential source of contamination during storage, preparation, transportation, display or service.
## 6 Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility\032s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies.
## 7 Cold food item held above 41Ã\u0082º F (smoked fish and reduced oxygen packaged foods above 38 Ã\u0082ºF) except during necessary preparation.
## 8 Plumbing not properly installed or maintained; anti-siphonage or backflow prevention device not provided where required; equipment or floor not properly drained; sewage disposal system in disrepair or not functioning properly.
## 9 Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.
## 10 Live roaches present in facility's food and/or non-food areas.
The bar chart of the violation codes for Latin restaurants looks almost the same as other three types of restaurants. The violation associated with the evidence of mice or live mice present in facility’s food and/or non-food areas have frequently made. For cafes, the regulation associated with living mice is less freuquently violated compared to other restaurant types. Latin restaurants have the issue of roaches as well. The Cafes tend to violate the code associated with sanitizing utensils and washing food contact surface more often. Though, compared to the overall (containing all the restaurants) frequency of the violation codes, the code associated with the mice one is ranked lower for these types of restaurants.
Also, we decide to study the association between the violation codes and the years. We still focus on the 5 most common restaurants to keep it simple and more interpretable. In order to produce a plot for each year, we take the same steps as described above except that we create new columns called year and month which extract year and month information from inspection date using mutate function.
Additional note: we decide to focus on 08A, 06D, 04L, and 06C only. While including all the violation codes provide more information about what overall violation trends across the year, this can make the graph more complicated and hard to look at. So, to make it simple and more readable, we choose 08A, 06D, 04L, and 06C as our x variable to see if there are any trends in these violation types. Again, because we are most interested in the shape of the distribution for each type of restaurant and each year, we set the scale to free to prevent some of the graphs from shrinking towards 0.
violation_type <- c("08A", "06D", "04L", "06C")
top5_year <- top5 %>% filter(violation_code %in% violation_type) %>% mutate(year = factor(format(inspection_date, "%Y")), month = factor(format(inspection_date, "%m"))) %>% select(cuisine, year, violation_code) %>% group_by(cuisine, year, violation_code) %>% summarise(count = n())
ggplot(top5_year, aes(x = violation_code, y = count, fill = year)) +
geom_col(position = "dodge") +
facet_grid(cuisine~., scale = "free") +
ggtitle("Trends in frequency of violation types") +
theme_grey(16)
One noticeable feature from this graph is that there are not much data in 2013. In fact, it is also shown in the dataframe that the frequency of any violation types for 2013 is less than 10. The frequency of 08A is highest in the year of 2017 for American, Cafe, Chinese, and Italian. Only the plot of Latin restaurants has the highest frequency of 08A in 2015. There is increasing pattern in any violation code for Chinese restaurants and Cafe. All of the violation codes reach a peak in 2017. It then decreases significantly in 2018, but it would probably be because the inspections has not finished yet for 2018. While there is an increasing pattern in 04L, 06D, and 08A for Italian and American restaurants, it seems like the frequency of the violations of 06C (Food not protected from potential source of contamination during storage, preparation, transportation, display or service) has decreased after 2016. For Latin restaurants, the frequency of all four type of violations increase at 2015 and then decrease after 2016. Also, the Latin restaurants are the only one which has a decreasing trend in 04L that is associated with mice.
violation_code_restype <- function(df, x){
new_df <- df %>% filter(cuisine == x) %>% select(violation_code) %>% group_by(violation_code) %>%
summarize(count = n()) %>% arrange(-count) %>% top_n(10)
return(new_df)
}
american <- df %>% violation_code_restype("American") %>% arrange(count)
## Selecting by count
italian <- df %>% violation_code_restype("Italian") %>% arrange(count)
## Selecting by count
chinese <- df %>% violation_code_restype("Chinese") %>% arrange(count)
## Selecting by count
cafe <- df %>% violation_code_restype("Cafe") %>% arrange(count)
## Selecting by count
latin <- df %>% violation_code_restype("Latin (Cuban, Dominican, Puerto Rican, South & Central American)") %>% arrange(count)
## Selecting by count
p <- plot_ly(type = 'bar', orientation = 'h') %>%
add_trace(x = italian$count, y = italian$violation_code, name = 'italian', visible = F) %>%
add_trace(x = chinese$count, y = chinese$violation_code, name = 'chinese', visible = F) %>%
add_trace(x = cafe$count, y = cafe$violation_code, name = 'Cafe', visible = F) %>%
add_trace(x = latin$count, y = latin$violation_code, name = 'latin', visible = F) %>%
add_trace(x = american$count, y = american$violation_code, name = 'american', visible = T) %>%
layout(
xaxis = list(title=""),
yaxis = list(title=""),
showlegend = FALSE,
updatemenus = list(
list(
buttons = list(
list(method = 'restyle',
args = list("visible", list(T,F,F,F,F)),
label = 'American'),
list(method = 'restyle',
args = list("visible", list(F,T,F,F,F)),
label = 'Italian'),
list(method = 'restyle',
args = list("visible", list(F,F,T,F,F)),
label = 'Chinese'),
list(method = 'restyle',
args = list("visible", list(F,F,F,T,F)),
label = 'Cafe'),
list(method = 'restyle',
args = list("visible", list(F,F,F,F,T)),
label = 'Latin')
)
)
)
)
p
## Warning: Can't display both discrete & non-discrete data on same axis
# Top 10 Resteraurants Ranked by Number of Critical Violation
# Regardless of type of violation
# group data
restaurant <- inspecttbl%>%group_by(INSPECTION.DATE, DBA, SCORE, GRADE, CRITICAL.FLAG,BORO, BUILDING, STREET, ZIPCODE)
# renaming DBA(doing business as usual) to restaurant
restaurant<- restaurant%>%rename( RESTAURANT = DBA)
# Count voilations according to restaurant and type of violation [~distinct] count the instances regardless of violation type per score, per restaurant, per date
count<-restaurant%>%summarise(count=n())
# select only critical violation
citical_count<-count%>%filter(CRITICAL.FLAG=='Critical')
# Find the worest restaurants by calculating the number of that violate with Grade C under critical situation
citical_count<-citical_count%>%filter(SCORE>27)%>%group_by(RESTAURANT,STREET)%>%mutate(violation_count=sum(count))
# order the data # restaurant the receives violates grade C most
res_critical_vio_count<-distinct(citical_count%>%select(STREET,RESTAURANT,violation_count)%>%arrange(desc(violation_count)))
result<-res_critical_vio_count[1:10,c(2,3)]
result$RESTAURANT<-reorder(result$RESTAURANT,result$violation_count)
g<-ggplot(result, aes(RESTAURANT, violation_count)) +
geom_bar(stat = 'identity',fill="tan2") +
coord_flip()+
#xlabel,ylabel
labs(title="10 Worst Resteraurants Ranked by Number of \n Critical Violation ",x = "", y = "Critical Violation Count")
g
#ggplotly(g,tooltip=c("x", "y"))
We want to evaluate the restaurants in NYC inspection and find the worst restaurants in NYC. We define two ways to evaluate how bad the restaurant is, by violation incidents and by score of those violation. For the first method, we count the number of violation incident with grade C under critical condition [CRITICAL.FLAG==‘Critical’] for each restaurant. Across all records, we plot the 10 restaurants with highest violation count. Overall, there are 5 restaurants have critical violation counts more than 40. Note that the first one ‘Gang San Deul’ receives over 45 violation that is marked critical. We can see that some of them are Asian Cuisine from the restaurant names.
# restaurant have the highest violation score
# Define bad restaurants as violation Score > 27 [Grade C] ~ 5 occurance of Grade C
citical_count_score<-citical_count%>%
filter(SCORE>27,count>5)%>%arrange(desc(SCORE))
result<-citical_count_score[1:10,c(2,3,10)]
result$RESTAURANT<-reorder(result$RESTAURANT,result$SCORE)
g<-ggplot(result, aes(RESTAURANT, SCORE)) +
geom_bar(stat = 'identity',fill="tan2") +
coord_flip()+
#theme(text = element_text(size=10), axis.text.x = element_text(angle=45, vjust=1, color = "grey")) +
labs(title="Worst 10 Highest Violation Score Restaurant ",x = "", y = "Score")
#ggplotly(g)
g
# save the csv for google api
# csv<-citical_count_score
# write.csv(csv[1:15,c(1,4,2,3,10)], file = '/Users/Selina/Desktop/Visualization/EDAV-master/chart.csv',row.names = FALSE)
For the second method, we want to give particular emphasis on the times of violation occurs for a restaurant during one inspection. We take into account of violations that have grade C [with score > 27] and have occurred more than 5 times for a restaurant. In the plot, we find that all of the restaurants have score above 100. In other words, these restaurants have at least 5 violation scored larger than 100 at the same date of inspection! Note that ’Margarita Island’ have the highest violation score [151] among all the restaurants.
Google chart is a nice API for visualization in javascript, base on the what we learned this semester on d3, we decide to explore this API a bit. Click here to see the interactive google chart version of the plot.
#### INSPECTION.TYPE Violation related####
# we focus on instore sanity check of the inspection#
# dis-regarding rows from initial inspections as grades and scores are not counted
gradables <- c("Cycle Inspection / Initial Inspection" , "Cycle Inspection / Re-inspection",
"Pre-permit (Operational) / Initial Inspection", "Pre-permit (Operational) / Re-inspection")
inspecttbl1 <- (filter(inspecttbl, `INSPECTION.TYPE` %in% gradables))
# critical score and inspection type of chain restaurant
# store per year average critical score variation
cafe<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(year = format(INSPECTION.DATE, "%Y"))
STARBUCKS<-cafe[grep("STARBUCKS",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
STARBUCKS$DBA<-"STARBUCKS"
DUNKIN<-cafe[grep("DUNKIN*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
DUNKIN$DBA<-"DUNKIN"
PICCOLO<-cafe[grep("PICCOLO*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
PICCOLO$DBA<-"PICCOLO"
GREGORY<-cafe[grep("GREGORY'S COFFEE*",cafe$DBA),]%>%group_by(year,BORO)%>%summarise(MEAN = mean(SCORE))
GREGORY$DBA<-"GREGORY"
score<-rbind(STARBUCKS,DUNKIN,GREGORY,PICCOLO)
score_boro<-score%>%filter(BORO=="MANHATTAN"|BORO=="QUEENS"|BORO=="BROOKLYN")
score_boro$DBA<-as.factor(score_boro$DBA)
csv<-score_boro
# change to local path
# write.csv(csv, file = 'YOURPATH',row.names = FALSE)
#write.csv(csv, file = '/Users/Selina/Desktop/Visualization/EDAV-master/boro_mean.csv',row.names = FALSE)
# csv <- read_csv("YOURPATH")
#csv <- read_csv("/Users/Selina/Desktop/Visualization/EDAV-master/boro_mean.csv")
g<-ggplot(csv, aes(x = year, y = MEAN, fill = DBA)) +
geom_col(position = "dodge") +
facet_grid(fct_relevel(score_boro$BORO,"MANHATTAN")~.) +
ggtitle("Mean Violation Score") +
theme_grey(16)
g+ scale_colour_colorblind()
ggplotly(g,tooltip=c("x", "y"))
We gather ‘Cafe’ category from all the restaurant types, and want to see the violation score difference among the cafes.
In this section, we want to examine how well those chain restaurants did in food safety. As mentioned by Julie, inspection types varies across the record. Again, we eliminated some of the inspections types as they are irrelevant of our goal, such as Trans Fat, Calorie type, Compliance, Special Program, and Administrative violations inspections.
Among all the Cafe, we pick out the 4 chain stores across U.S., including STARBUCKS, DUNKIN, GREGORY, and PICCOLO and evaluate the quality of those chain cafes store in NY. Sine all 4 of them have many local stores in NYC. It is easy to visualize according to the mean violation score of each base on Borough.
Among these 4 Cafes, GREGORY and PICCOLO stores are only located in Manhattan. And PICCOLO started to be recorded after 2015. Looking at the plot of Manhattan,
For Brooklyn and Queens,
# CAFE common violation type
# edit
STARBUCKS_type<-inspecttbl1[ grep("STARBUCKS",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/2431)%>%arrange(desc(vio_freq))
STARBUCKS_type$DBA<-"STARBUCKS"
interm<-inspecttbl1[ grep("DUNKIN*",inspecttbl1$DBA),]
interm$DBA<-"DUNKIN"
DUNKIN_type<-interm%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/6004)%>%arrange(desc(vio_freq))
DUNKIN_type$DBA<-"DUNKIN"
PICCOLO_type<-inspecttbl1[ grep("PICCOLO*",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/339)%>%arrange(desc(vio_freq))
PICCOLO_type$DBA<-"PICCOLO"
GREGORY_type<-inspecttbl1[ grep("GREGORY'S COFFEE*",inspecttbl1$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n()/98)%>%arrange(desc(vio_freq))
GREGORY_type$DBA<-"GREGORY"
type<-rbind(STARBUCKS_type[1:5,], DUNKIN_type[1:5,],PICCOLO_type[1:5,],GREGORY_type[1:5,])
type$DBA<-factor(type$DBA)
type<- type%>%rename( ViolationCode = VIOLATION.CODE,frequency=vio_freq)
g<-ggplot(type, aes(x = ViolationCode, y = frequency)) +
geom_col(position = "dodge",fill="tan2") +
facet_grid(~ fct_relevel(type$DBA,"STARBUCKS"),scales = 'free')+
ggtitle("Most Frequent Violation Code for Cafes") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))+
theme_grey(16)
g<- ggplotly(g,tooltip=c("x", "y"))
g
After seeing how mean violation score of those chain cafe varying according to year and borough, we look at the most frequent violation code for each of them.
10F ‘Non-food contact surface improperly constructed’ appears in Top 5 most frequent violation for every Cafe. It makes sense since it is the most common violation in NYC. Notice that in STARBUCKS, it really stands out from all the violation and count for about 30% of all of STARBUCKS stores records. The result is good, since the violation is relatively trivial comparing to 04 series categories (eg. 04N the evidence of flies etc.) Whereas in DUNKIN and GREGORY, 10F counts for over 15% of the violation records.
In the case of PICCOLO [which have the highest violation score], the top 5 violation only counts very small portions of overall violation.
# select code c('10F','10B','08A')
# 08A vermin
# 04N flies
code<-inspecttbl1%>%filter(VIOLATION.CODE=='10F'|VIOLATION.CODE=='10B'|VIOLATION.CODE=='08A')%>%mutate(year = format(INSPECTION.DATE, "%Y"),month = format(INSPECTION.DATE, "%Y"))
STARBUCKS_type<-code[ grep("STARBUCKS",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
STARBUCKS_type$DBA<-"STARBUCKS"
DUNKIN_type<-code[ grep("DUNKIN*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
DUNKIN_type$DBA<-"DUNKIN"
PICCOLO_type<-code[ grep("PICCOLO*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
PICCOLO_type$DBA<-"PICCOLO"
GREGORY_type<-code[ grep("GREGORY'S COFFEE*",code$DBA),]%>%group_by(DBA,VIOLATION.CODE)%>%summarise(vio_freq=n())%>%arrange(desc(vio_freq))
GREGORY_type$DBA<-"GREGORY"
code<-rbind(STARBUCKS_type[1:5,], DUNKIN_type[1:5,],PICCOLO_type[1:5,],GREGORY_type[1:5,])
code$DBA<-factor(code$DBA)
code$DBA<-fct_relevel(code$DBA,"STARBUCKS")
# code$DBA<-fct_relevel(code$DBA,"PICCOLO", )
mosaic(VIOLATION.CODE ~ DBA, code, direction = c("v", "h"),labeling= labeling_border(rot_labels = c(15,0,0,0)),main="Violation Code Ratio of Cafes") #gp = gpar(fill = c("blue", "lightblue"),rot_labels=c(0,90,0,0),just_labels="right" )
#, labeling= labeling_border(rot_labels = c(0,90,0,0),
# just_labels = c("center",
# "center",
# "center",
# "right"))
We find the 10F[non-food contact surface improperly constructed], 10B[improper sewage disposal system], 08A[Facility not vermin proof and have potential to attract vermin] are the most frequent violation code combination across these 4 stores.
A close look at their distribution by store would be interesting. From the plot, we can see that STARBUCKS and DUNKIN has very similar violation ratio distribution, whereas GREGORY have more violation in 10B , and less in 10F and 08A. PICCOLO has a significant increase in 08A.
# per year/date
yearmonth<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(yearmonth = format(INSPECTION.DATE, "%Y-%m"))
STARBUCKS<-yearmonth[grep("STARBUCKS",yearmonth$DBA),]%>%group_by(yearmonth)%>%filter(VIOLATION.CODE=="08A")%>%summarise(Score = mean(SCORE))
STARBUCKS$Date<-as.Date(as.yearmon(STARBUCKS$yearmonth, "%Y-%m"))
g<-ggplot(STARBUCKS) + geom_line(aes(Date, Score)) +
geom_point(aes(Date, Score))+
ggtitle("STARBUCKS Violation Score 08A Monthly Mean Score Variation") +
labs (x = "Month", y = "Score") +
theme_grey(16) +
theme(legend.title = element_blank())
p <- ggplotly(g, tooltip=c("x", "y"))
p
As Lisa has shown in the previous section, 10F and 08A are two most frequent violation in NYC. We decided to look at 08A violation particular in STARBUCKS since it represents the dimension we cared about more or more serious in food safety issue of the city - ‘potential presence of vermin (eg. mice).’
The time series plot according to ‘potential vermin presents [08A]’ violation shows the monthly change of score for STARBUCKS the past 5 years. There is a general pattern of decreasing in score across years. And the variation of score also decreases, meaning that all STARBUCKS in NYC are reaching more and more consistent food safety quality maybe through stricter regularization.
# analysis of area
CAFE<-inspecttbl1%>%filter(CUISINE.DESCRIPTION=="Café/Coffee/Tea")%>%mutate(year = format(INSPECTION.DATE, "%Y"))
STARBUCKS<-CAFE[grep("STARBUCKS",CAFE$DBA),]%>%group_by(BORO,year)%>%filter(VIOLATION.CODE=="08A")%>%summarise(Score = mean(SCORE))
STARBUCKS<-STARBUCKS%>%filter(year!="2018",year!="2014")
STARBUCKS$Boro<-reorder(STARBUCKS$BORO,STARBUCKS$Score)
g<-ggplot(STARBUCKS) +
geom_bar(aes(Boro, Score),stat = 'identity',fill="tan2") +
facet_grid(~year)+
labs(title="Mean Violation Code 08A By BORO ",x = "NYC Borough", y = "08A Violation Score")+
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
g <- ggplotly(g,tooltip=c("x", "y"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
g
We are wondering if this violation score is different across different area of NYC as well. The Mean Violation Code By Boro shows the following trend. Since the data in 2014 and 2018 are not recorded full-year around. We take 2015, 2016, 2017 to analysis the overall trend.